首页> 外文OA文献 >Jointly Modeling Embedding and Translation to Bridge Video and Language
【2h】

Jointly Modeling Embedding and Translation to Bridge Video and Language

机译:嵌入式翻译与桥梁视频与语言的联合建模

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Automatically describing video content with natural language is a fundamentalchallenge of multimedia. Recurrent Neural Networks (RNN), which models sequencedynamics, has attracted increasing attention on visual interpretation. However,most existing approaches generate a word locally with given previous words andthe visual content, while the relationship between sentence semantics andvisual content is not holistically exploited. As a result, the generatedsentences may be contextually correct but the semantics (e.g., subjects, verbsor objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memorywith visual-semantic Embedding (LSTM-E), which can simultaneously explore thelearning of LSTM and visual-semantic embedding. The former aims to locallymaximize the probability of generating the next word given previous words andvisual content, while the latter is to create a visual-semantic embedding spacefor enforcing the relationship between the semantics of the entire sentence andvisual content. Our proposed LSTM-E consists of three components: a 2-D and/or3-D deep convolutional neural networks for learning powerful videorepresentation, a deep RNN for generating sentences, and a joint embeddingmodel for exploring the relationships between visual content and sentencesemantics. The experiments on YouTube2Text dataset show that our proposedLSTM-E achieves to-date the best reported performance in generating naturalsentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We alsodemonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO)triplets to several state-of-the-art techniques.
机译:用自然语言自动描述视频内容是多媒体的基本挑战。建模序列动力学的递归神经网络(RNN)在视觉解释上引起了越来越多的关注。然而,大多数现有方法在本地产生具有给定的先前单词和视觉内容的单词,而句子语义与视觉内容之间的关系并未得到全面利用。结果,所生成的句子在上下文上可能是正确的,但是语义(例如,主语,动词或宾语)是不正确的。本文提出了一种新颖的统一框架,即带有视觉语义嵌入的长短期记忆(LSTM-E),它可以同时探索LSTM和视觉语义嵌入的学习。前者的目的是在给定先前单词和视觉内容的情况下最大程度地最大化生成下一个单词的概率,而后者的目的是创建视觉语义嵌入空间,以加强整个句子的语义与视觉内容之间的关系。我们提出的LSTM-E由三个组件组成:用于学习强大的视频表示的2-D和/或3-D深卷积神经网络,用于生成句子的深度RNN,以及用于探索视觉内容与句子语义之间关系的联合嵌入模型。 YouTube2Text数据集上的实验表明,我们提出的LSTM-E在生成自然句方面实现了迄今为止报告的最佳性能:分别以BLEU @ 4和METEOR计分别达到45.3%和31.0%。我们还演示了LSTM-E在预测主语-宾语-宾语(SVO)三胞胎方面优于几种最新技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号